Exploring Continuous Variables

Author

Deepak Varughese

Objectives

In these slides we will understand how to describe numeric variables in your dataset.
We will explore the ideas of the 4 moments of a dataset
- Measures of Centrality
- Measures of Spread
- Skewness
- Kortusis

The Dataset

For this we will use a clean version of the data present on https://www.kaggle.com/datasets/datasetengineer/public-health-dataset

This is a fictional public health dataset and is being used only for educational purposes.

Note that the data has already been pre-processed and cleaned. Since data cleaning and processing is not the aim of this tutorial we will skip these steps.

Loading the libraries and importing data.

pacman::p_load(
  tidyverse,
  here, 
  rio,
  janitor,
  skimr,
  flextable,
  moments
)

data <- import(here("posts5","exercise_data_1.csv"))

data_clean <- data %>%
  mutate(across(
    c(vaccination_status, compliance_with_health_guidelines),
    ~ factor(
      case_when(
        .x == 1 ~ "Yes",
        .x == 0 ~ "No"
      ), 
      levels = c("No", "Yes") 
    )))

In the above steps we can see that we have imported the dataset and then fixed some of the variables so as to be in the correct data types. We now have a new object called data_clean that contains the clean dataset that is ready for analysis. Let us have a look at the dataset

head(data_clean) %>% 
  flextable()

age	gender	location	ethnicity	ses	vaccination_status	temperature	aqi	humidity	travel_history	social_activity	compliance_with_health_guidelines	reported_symptoms	testing_results	disease_severity	hospitalization_requirement
51	Male	Urban	Ethnicity3	Low	No	29.31228	133	31.51287	No Travel	Low	Yes	Mild	Negative	Mild	Requires Hospitalization
92	Male	Urban	Ethnicity2	Low	No	35.98183	121	76.88069	No Travel	Medium	Yes	Severe	Negative	Mild	Requires Hospitalization
14	Male	Urban	Ethnicity3	Low	Yes	28.17669	42	86.36421	No Travel	Low	No	None	Negative	Moderate	No Hospitalization
71	Female	Urban	Ethnicity2	Low	No	18.59550	142	28.82366	No Travel	Medium	No	Moderate	Negative	Severe	No Hospitalization
60	Female	Urban	Ethnicity1	Low	No	38.27969	231	39.47306	No Travel	Medium	Yes	Mild	Negative	Mild	No Hospitalization
20	Female	Urban	Ethnicity1	Low	Yes	10.86268	10	85.72899	International	Low	Yes	Mild	Negative	Mild	No Hospitalization

We can now see that the dataset has 16 variables. 12 of these are categorical and 4 of them are numeric. We have explored this in the last tutorial.

Exploring Numeric/ Continuous Variables

A dataset with a continuous variables has rows and rows of data. How do we summarize this data in order for it to make sense to us? In order to describe that variable we need to find a way to collapse all the 43000 rows into some meaningful summary that would make sense to people. Only then can we transform these rows into some meaningful insight that can benefit people.

We usually do this with the following

Measures of Centrality and Location
Measures of Variability

Measures of Shape of the data

Measures of Centrality: Finding the “Heart” of Your Data

When we have 43,000 rows of data, we need a single value that represents the “typical” observation. This is the Measure of Centrality.

The Mean

What is it? It is the sum of all values divided by the total number of observations (\(n\)). It represents the mathematical “balance point” of your data.
When to use it: Use the mean when your data is symmetric and has no extreme outliers (e.g., adult height or blood pressure in a healthy population).
The “Why” : Most clinical trials report the mean because it is mathematically powerful—it allows for advanced hypothesis testing (like t-tests).
The Limitation: It is not “robust.”

Example: You create a 7-minute health education video. If the Mean View Time is 3 minutes, it doesn’t necessarily mean most people watched half the video. It could mean 90% of people clicked away in 5 seconds, but 10% watched it 10 times over. The mean is “dragged” by those few enthusiasts.

The Median

What is it? The middle value when all observations are ranked from lowest to highest. It splits your dataset exactly in half.
When to use it: Use the median when your data is skewed or contains outliers.
The “Why” (Biomedical Context): In medicine, we often look at “Median Survival Time” or “Length of Stay.”
Example: In a hospital ward, 10 patients stay for 3 days, but one patient with complications stays for 90 days.
- Mean: ~11 days (misleading; suggests the ward is inefficient).
- Median: 3 days (accurate; reflects the experience of the “average” patient).

The Trimmed Mean

What is it? A mean calculated after discarding a small percentage (usually 5% or 10%) of the lowest and highest values.
When to use it: Use it when you want to utilize more data than the median provides, but you want to protect the result from being skewed by “freak” occurrences or measurement errors.
The “Why” (Biomedical Context): It is used in situations where “bias” or “noise” is expected at the extremes.

Example: In Olympic diving or gymnastics, the highest and lowest scores are dropped before averaging. In a lab setting, if you run a blood test 10 times and one result is wildly different due to a technical glitch, a trimmed mean ensures that one “glitch” doesn’t ruin your entire average.

Summary Table

Measure	Sensitive to Outliers?	Best Used For…
Mean	Yes (Highly)	Normal, bell-shaped data.
Median	No (Robust)	Skewed data (Income, Recovery time).
Trimmed Mean	No (Limited)	Removing “noise” or bias from the extremes.

2. Measures of Variability: How Spread Out is the Data?

If centrality tells us where the middle is, variability tells us how much the individual patients differ from that middle. High variability means the “average” is less representative of any single patient.

How do we now calculate the “Average distance of each value from the mean”?

Can we simply add all the distances of each point from the mean and then dividing by the number of observations? That would be the logical way to calculate “Average distance from the mean”?

Actually no. Let us consider the following scenario.

# 1. Create the dataset
example_data <- data.frame(
  patient_id = as.character(1:10),
  value = c(2, 3, 4, 4, 5, 5, 6, 6, 7, 8)
)

# 2. Calculate deviations and squared deviations
table_df <- example_data %>%
  mutate(
    mean = mean(value),
    distance_from_mean = value - mean
  ) %>% 
  flextable()

table_df

patient_id	value	mean	distance_from_mean
1	2	5	-3
2	3	5	-2
3	4	5	-1
4	4	5	-1
5	5	5	0
6	5	5	0
7	6	5	1
8	6	5	1
9	7	5	2
10	8	5	3

If we add all the distance from the mean and now take and average we will actually get “Zero”!

This is because for every positive difference from the mean we now have a negative value from the mean. This would keep our net difference as zero simply because that is how addition and algebra work.

To counter this we now need a strategy to remove the negative sign for the values below zero.

To do this we can use two mathematical approaches

Square the values - Variance and Standard Deviation
Take the absolute values and remove the negative sign - Mean Absolute Deviation / Median Absolute Deviation

Use a location based estimate similar to the median (Interquartile Range)

Variance (\(s^2\)) and Standard Deviation

What is it?

Take the square of each value for distance from the mean. This will automatically remove the negative sign. Now divide this number by the number of observations. This is your variance. This is however difficult to interpret as the scale and units are now in “square units”. To convert it back to the original scale we now take the square root of the variance. This is our Standard Deviation.
When to use it: Variance is rarely reported directly. However it powers many statistical tests. The more commonly used statistic is the standard deviation. Whenever the mean is reported, it must always be accompanied by a report of standard deviation.
“Why”: Standard Deviation is much more sensitive to values that are far away from the mean. Because you are squaring the distances, a point that is 4 units away contributes 16 to the variance, while a point 2 units away only contributes 4. This can be important in cases where small fluctuations are important.

Example: Imagine you are monitoring a patient’s heart rate. If the rate fluctuates slightly, it’s fine. But if it occasionally spikes wildly, you want a metric that “screams” when those spikes happen. SD screams louder than MAD because it squares those extreme gaps, alerting the researcher to high instability.

Mean Absolute Deviation (MAD) or Median Absolute Deviation (MedAD)
- What is it?
  
  These measures use another approach to get rid of the negative sign. They simply use the absolute values ignoring the negative sign. They look at the absolute distance of each point from the center (Mean or Median) without squaring them.
- The “Why” and When
  
  Squaring the differences (as in Variance/SD) gives massive weight to outliers. MAD does not do this and gives a more even estimate. When to use what will depend on how important the outliers are to the dataset.
  
  MAD were traditionally not used much in statistics because their use in inferential statistics etc can be computationally intensive. Since traditional statistics were done using calculus , pen and paper the standard deviation was always used. However MAD can be a more intuitive measure and is quickly gaining importance in data science especially in finance and other domains.

Interquartile Range (IQR)

What is it?

Suppose we were to list out all the values in the dataset in ascending order and divide them into 4 equal groups. The first group is from 0 to 25% of the values. The second from 25 to 50% of values. The third is the 50th to 75th percent of values and the last from 75th to 100 percent of values. These are called the 4 quartiles of the variable. The Interquartile range is the difference between where the 2nd Quartile Starts and the 3rd Quartile ends. Essentially it would contain 50% of the values.
When to use it: Always use this when you are reporting a Median. It is the “robust” partner to the median.

The “Why”: Like the median, the IQR is not affected by extreme outliers.

patient_data <- data.frame(
  Value = c(2, 3, 3, 4, 5, 5, 6, 7, 8, 9, 10, 12, 15, 18, 22, 30)
) %>% 
  arrange(Value) %>%
  mutate(Patient_ID = row_number())


patient_data <- patient_data %>%
  mutate(
    Quartile = case_when(
      Patient_ID <= 4  ~ "Q1 (0-25%)",
      Patient_ID <= 8  ~ "Q2 (25-50%)",
      Patient_ID <= 12 ~ "Q3 (50-75%)",
      TRUE             ~ "Q4 (75-100%)"
    )
  )

# 3. Create the table
patient_data %>%
  select(Patient_ID, Value, Quartile) %>%
  flextable() %>%
  set_header_labels(
    Patient_ID = "Patient Rank",
    Value = "Recovery Days",
    Quartile = "Quartile Group"
  ) %>%
  # Color-code the quartiles to show the 4 distinct groups
  bg(i = 1:4, bg = "#E1F5FE", part = "body") %>%
  bg(i = 5:8, bg = "#B3E5FC", part = "body") %>%
  bg(i = 9:12, bg = "#81D4FA", part = "body") %>%
  bg(i = 13:16, bg = "#4FC3F7", part = "body") %>%
  # Highlight the boundaries for IQR (Q1 and Q3)
  bold(i = c(4, 12), j = 2) %>%
  add_footer_lines("The IQR is the range between the end of Q1 (4 days) and the end of Q3 (12 days).") %>%
  autofit()

Patient Rank	Recovery Days	Quartile Group
1	2	Q1 (0-25%)
2	3	Q1 (0-25%)
3	3	Q1 (0-25%)
4	4	Q1 (0-25%)
5	5	Q2 (25-50%)
6	5	Q2 (25-50%)
7	6	Q2 (25-50%)
8	7	Q2 (25-50%)
9	8	Q3 (50-75%)
10	9	Q3 (50-75%)
11	10	Q3 (50-75%)
12	12	Q3 (50-75%)
13	15	Q4 (75-100%)
14	18	Q4 (75-100%)
15	22	Q4 (75-100%)
16	30	Q4 (75-100%)
The IQR is the range between the end of Q1 (4 days) and the end of Q3 (12 days).

Example:

Consider the above example which looks at days of recovery for In-Patients at a ward. Notice how the IQR of 4-12 was calculated.

Coefficient of Variation (CV)

What is it?

The Coefficient of Variance is a bit different from everything we discussed. It is not measure of variability in the way the other variables here have been discussed. It is he ratio of the Standard Deviation to the Mean \((SD / Mean)\), often expressed as a percentage.
When to use it:

The CV is used to compare the variability of two different variables that have different units or wildly different scales. The standard deviations of 2 different variables cannot be directly compared as they may be measured on different scales and units. We now need a way to normalize this. This is done by dividing by mean and therefore removing the scale from the equation.
The “Why”: It provides a “normalized” measure of spread.
Suppose you want to know if a patient’s Cholesterol is more unstable than their Body Weight. You can’t compare “mg/dL” to “kg” directly.
- Weight: Mean = 80kg, SD = 2kg \(\rightarrow\) CV = 2.5%
- Cholesterol: Mean = 200mg/dL, SD = 20mg/dL \(\rightarrow\) CV = 10%
Interpretation: Even though the SD for cholesterol (20) looks much bigger than the SD for weight (2), the cholesterol is actually four times more variable relative to its average.
```
# Comparing Weight vs Cholesterol Variability
comparison <- data.frame(
  Metric = c("Body Weight (kg)", "Cholesterol (mg/dL)"),
  Mean = c(80, 200),
  SD = c(2, 20)
) %>%
  mutate(CV = (SD / Mean) * 100)

comparison %>%
  flextable() %>%
  set_header_labels(CV = "CV (%)") %>%
  colformat_double(digits = 1) %>%
  set_caption("Table: Using CV to Compare Variability Across Different Scales")
```
Metric
Mean
SD
CV (%)
Body Weight (kg)
80.0
2.0
2.5
Cholesterol (mg/dL)
200.0
20.0
10.0

Metric	Mean	SD	CV (%)
Body Weight (kg)	80.0	2.0	2.5
Cholesterol (mg/dL)	200.0	20.0	10.0

Summary Table for Students

Measure	Pair it with	Unit of Measure	Best for…
Standard Deviation	Mean	Same as data (e.g., mmHg)	Normal distributions.
IQR	Median	Same as data (e.g., days)	Skewed data/Clinical stay.
CV	N/A	Percentage (%)	Comparing consistency across different scales.

We will look at measures of the spread of the data in the next tutorial